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This study aimed to analyze the opinions and emotions of Malaysians 
towards the COVID-19 vaccination program, as expressed on Twitter. By 
collecting data from the Twitter network and utilizing the machine learning 
life cycle technique. The results show that Malaysians have a mostly neutral 
viewpoint of the COVID-19 vaccination, with an accuracy score of 93%, an 


Fl-score of 94%, a recall measurement of 94%, and a precision measure of 
93%. These findings emphasize the significance of understanding public 
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1. INTRODUCTION 

In December 2019, China was hit by a sudden outbreak of COVID-19 caused by the SARS-CoV-2 
virus. The World Health Organization (WHO) labeled it a pandemic due to its severe and widespread nature, 
which can lead to severe pneumonia, respiratory failure, and death. The entire world, including Malaysia, has 
been affected by this pandemic. To combat the spread of COVID-19, a global effort has been made to develop 
and test vaccines. Furthermore, one of the best methods for lowering the prevalence of infectious diseases is 
vaccination. Numerous vaccines have been developed and approved in a short time frame to counter the 
pandemic. One example is the Pfizer/BioNTech vaccine, which was the first to be approved for widespread 
use in the United Kingdom on December 2, 2020, less than a year after the pandemic was declared. However, 
a sizable portion of people express hesitation and even hostility toward vaccination [1], [2]. This hesitation 
mainly comes from the acceptance of public concern about the vaccination in terms of its: i) health risk, 
ii) cultural acceptance, iii) religious acceptance, iv) economic growth, and v) political stand [3]. This 
hesitation and reluctance has led to the way individuals perceive the risk of getting infected, as well as how 
they view the gravity of the infection which in turn leads to a low acceptance rate of the vaccine [4], [5]. The 
reluctance to get vaccinated could have a significant and far-reaching impact on the acceptance of COVID-19 
vaccines by people in the community as it poses a threat not only to the hesitant individual but to the entire 


Journal homepage: http://beei.org 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 437 


community. Delays and rejections would make it impossible for communities to reach the required level of 
vaccine uptake necessary for herd immunity to be achieved [6]. Currently, the focus is on developing a 
vaccine to protect the population from COVID-19, but it is important for stakeholders to be prepared for the 
next challenge, which is ensuring the vaccine is accessible and accepted by the public. 


2. LITERATURE REVIEW 

Over the last few years, as the COVID-19 pandemic spread globally, the COVID-19 vaccine-related 
issues have received increased public attention, especially relating to the public hesitation to be vaccinated. 
The COVID-19 pandemic has raised public concern about vaccine hesitancy, which can be broken down into 
three main reasons: i) evaluating the risks and benefits of vaccines, ii) lack of knowledge and awareness, and 
iii) influence of religious, cultural, gender, and or socio-economic factors [7]. This hesitancy is a result of 
poor health literacy thus leading to a low acceptance of the COVID-19 vaccines [8], [9]. Another major 
reason leading to the limited uptake of vaccines is the impact of social media, especially the usage of 
Twitter [10]. An extensive literature review has shown that social media, particularly Twitter is an excellent 
channel for expressing emotions, perspectives, and viewpoints [11]. 

Furthermore, Twitter is a social media platform where people can openly share their opinions. 
Twitter provides a place for individuals to honestly communicate their ideas in real-time, with over 
100 million active users and up to 500 million tweets generated everyday [12]. Twitter is also a useful tool 
for evaluating the true public mood since users may express themselves freely and at ease, in contrast to 
traditional face-to-face interviews. Additionally, data collection for studies including opinion analysis is 
facilitated by Twitter's application programming interface (API) and open database access [13]. Previous 
research has shown that the public's regular use of Twitter during COVID-19 boosted health awareness [14] 
and the execution of appropriate health safety measures during the pandemic. As a result, several government 
organisations have started using tweets to manage crises and deliver real-time updates [15], [16]. 
Additionally, prior research has shown that these tweets' messages (such as opinions) can help the 
responsible authority get a high-level grasp of the actual situation, particularly during the COVID-19 
pandemic [17]. Since these beliefs and related ideas such as sentiments, attitudes, and emotions are 
fundamental to human activity, applying sentiment analysis to analyse tweets could show how the general 
public feels about the COVID-19 vaccination [14], [15]. The process of analysing the opinions, feelings, and 
sentiments represented in words or sentences is known as sentiment analysis, sometimes known as opinion 
mining [18], [19]. Sentiment analysis has grown in favour in the medical industry as a useful method for 
determining peoples' views toward vaccinations, immunisation, and public health in general [13]. 

According to by Hussein et al. [20], sentiment analysis of tweets about the COVID-19 vaccination 
could be a valuable tool for policymakers and governments as it enables them to keep track of public opinion 
and make informed decisions. According to by Rosis et al. [21], by stating important measures against 
COVID-19, such as getting vaccinated, wearing masks, practicing social distancing, and maintaining 
personal hygiene, has greatly contributed to controlling the spread of the virus. Twitter, as one of the 
prominent social media platforms, plays a significant role in raising awareness of these crucial measures. 
Public perception and attitude towards the pandemic are crucial in developing effective strategies to combat 
it [21], [22]. In this regard, the analysis of social media provides valuable information for health 
professionals and government officials in their decision-making processes. Based on the paragraphs above, 
this study aims to gain deeper insights into what people are thinking and feeling regarding COVID-19 
vaccination, by examining tweets. Furthermore, this study collects tweets using keywords related to vaccines 
and health concerns post-vaccination to gain insight into public perception and assist policymakers in 
planning the vaccination effort and health measures. By analyzing the Twitter data, healthcare professionals 
and policymakers can gain understanding of how the public is reacting to the COVID-19 vaccine during the 
pandemic. The study also hopes to shed light on people's views on health guidelines for COVID-19 
prevention after receiving the vaccine. 


3. METHOD 

In order to attain the goal of the study, the machine learning life cycle (MLLC) method was 
selected. This technique is a highly effective method with broad applications and has been shown to produce 
results with superior accuracy when compared to those that involve human intervention, which tend to have a 
lower accuracy rate [23], [24]. The MLLC method consists of seven steps, including: i) data collection, 
ii) data preparation, iii) data cleaning, iv) data analysis, v) model training, vi) model testing, and 
vii) implementation. These steps will be discussed briefly in the following sub-sections. 
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3.1. Data gathering 

Data gathering is the first stage of the machine learning life cycle, tries to identify and gather data 
related problems. Identifying different data sources, such as files, databases, the internet, and mobile devices, 
is part of this process. The hashtags "COVID-19Vaccine", "#AstraZeneca", "#Sinovac", "#Pfizer", and 
"VaccineSideEffect" were used to collect data for this study on the Twitter platform. An open-source tool 
called StartBot was used to gather information going back to 2002. In order to create a cohesive dataset that 
will be used in the following stages, the process entails identifying many data sources, gathering data, and 
integrating data from different sources. 


3.2. Data preparation 

After collecting data, the next step is to plan for the following stages, which includes data 
preparation. This process involves organizing the data in an appropriate location and preparing it for use in 
machine learning training. This includes randomly selecting the data's ordering, extracting tweet details such 
as the hashtag, username, user handle, date of postings, tweets, retweet counts and like counts, and saving it 
in an excel file for faster access to the project. Data exploration and data pre-processing are two procedures 
that fall under this category. Data exploration is used to understand the type of data being dealt with, 
identifying features, format, and quality of the data. This step helps in identifying correlations, general 
patterns, and outliers in the data. The next phase in the data pre-processing process is data pre-processing for 
analysis. This dataset contains only tweets concerning the COVID-19 vaccine expressed in English. 


3.3. Data cleaning 

The act of cleaning and turning raw data into a usable format is known as data wrangling. It is the 
process of cleaning the data, selecting the variable to utilise, and changing the data into a suitable format for 
analysis in the following phase. It is one of the most crucial phases in the entire procedure. To overcome the 
quality concerns, data must be cleaned. It is not required that the data gathered must be of constant use to 
anyone, as part of the data may not be. Missing values, duplicate data, invalid data, and noise are all 
problems that might arise in real-world applications. As a result, cleaning the data involves a variety of 
filtering approaches. The above issuess must be identified and resolved since they might have a detrimental 
impact on the quality of the final product. The text cleaning in this project may be done using Python code 
that removes numbers, stickers, old style retweets ‘RT’, hashtags, punctuation and stop words. 


3.4. Data analysis 

The data has now been cleansed and prepped and is ready to be analysed. This process entails 
choosing analytical methodologies, creating models, and analysing the results. The goal of this stage is to 
create a machine learning model that will study the data using a variety of analytical approaches and then 
evaluate the results. This stage involves categorising the term as positive, negative, or neutral. To analyse the 
data in this study in context of Malaysian views on the COVID-19 vaccine, polarity and subjectivity have 
been estimated. It begins with determining the issue type, after which machine learning techniques such as 
classification, regression, cluster analysis, association, and others are chosen. The model is then built using 
the data that has been prepared, and the model is subsequently evaluated. As a result, during this stage, it will 
take the data and develop the model using machine learning methods. The Naïve Bayes approach was utilised 
to create the sentiment classifier used for emotion identification of Malaysian perspectives on COVID-19 
vaccination. 


3.5. Model training 

The following stage is to train the model and, in this phase, the model must be trained to increase its 
performance in order to achieve a better solution to the problem. It employs a variety of machine learning 
methods to train the model utilising datasets. A model must be trained for it to comprehend the numerous 
patterns, rules, and characteristics. A dataset is utilised in this project to train the model using the Naïve 
Bayes technique in the scikit-learn Python module. 


3.6. Model testing 

The machine learning model may be tested once it has been trained on a specific dataset. The 
assessment of the correctness of the model during this stage is done by feeding it a test dataset. The percentage 
of correctness for the model is determined by testing it against the project or problem's requirements. The 
project must go through the testing model phase in order to assess the accuracy of sentiment analysis. Section 4 
provides detailed calculations and explanations on how to determine the accuracy score, precision, recall, and 
Fl-score. However, testing is usually done to see if the suggested design fits the initial set of business 


Bulletin of Electr Eng & Inf, Vol. 13, No. 1, February 2024: 436-443 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 439 


objectives. Testing may be done again to look for mistakes, defects, and interoperability. Verification and 
validation are two more aspects of this phase that will assist to assure the program's success. 


3.7. Implementation 

The focus of this study was to conduct a comprehensive analysis of sentiment analysis techniques 
applied to Twitter data, aiming to provide insights into their effectiveness and performance. Given the scope 
and depth of the analysis conducted, the decision was made to defer the implementation stage to future 
research, allowing for a more thorough investigation of real-world deployment challenges and considerations. 


4. RESULTS AND ANALYSIS 

This section will demonstrate the outcomes and interpretation of the research, structured in: 
i) interface, ii) process of gathering tweets, iii) findings of sentiment analysis, and iv) evaluation of the 
sentiment analysis classifier model. Further, the interface design will illustrate how users can interact with 
the sentiment analysis system and interpret the results effectively, enhancing usability and accessibility. The 
detailed exposition of the tweet gathering process will provide a clear understanding of data collection 
methodologies, addressing potential biases and limitations in the dataset. Additionally, the presentation of 
sentiment analysis findings will delve into nuanced insights derived from the analysis, shedding light on 
patterns, trends, and potential applications. Lastly, the evaluation of the sentiment analysis classifier model 
will encompass rigorous quantitative metrics and qualitative assessments, establishing a comprehensive 
assessment of its performance and generalizability. 


4.1. Interface of the sentiment analysis application 

The web dashboard as shown in Figure 1 incorporates dynamic visualizations that allow users to 
interact with the sentiment analysis results in real-time, enabling the exploration of sentiment trends across 
different time periods, user demographics, and tweet characteristics. By utilizing Tableau's robust features, 
the dashboard provides an intuitive and comprehensive representation of the sentiment analysis outcomes, 
enhancing the accessibility of insights and aiding decision-making processes for various stakeholders. 


y Malaysian Views on COVID-19 Vaccination Program 


A Sentiment Analysis Dashboard 


Total Like 


Like count 


< > 


Total Retweet 


Value 
COVID-19 Reviews Classifications Word Cloud 


Hashtags 
e0000 


SVaccineSideEffects 


Hashtag 


Neutral 


Figure 1. Web dashboard 


4.2. Tweets gathering process 

This section describes the process of collecting tweets for the dataset. The data was obtained from 
Twitter using five specific hashtags, which were: i) '#COVID 19Vaccination’, ii) #AstraZeneca’, iii) '#Pfizer', 
iv) '#Sinovac', and v) '#VaccineSideEffects'. The process started by utilizing the StartBot open-source program 
to extract tweets from Twitter. Afterwards, the dataset underwent data preparation and cleaning, as outlined in 
the methodology section. During the data cleaning step, any unimportant punctuation, stop words, and 
sentences were removed from the tweet’s column. Figure 2, illustrates the sample code employed in the data. 
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4.3. Sentiment analysis results 

The analysis of the cleaned dataset revealed that most of the COVID-19 data was neutral in tone, 
while only a small proportion had a negative sentiment refer Figure 3. This suggests that most people have a 
positive attitude towards the vaccination program. This finding can be reinforced by the daily updates on the 
Malaysian COVID-19 website, which provides information about the progress of the vaccination program. 

The study presents a word cloud to showcase the public's opinions about COVID-19. The word 
cloud, displayed in Figure 4, focuses specifically on the topic of COVID-19 vaccinations. The results show 
that the Pfizer vaccine is the most talked about and tweeted vaccine on Twitter, as it is highlighted in a larger 
font size and in bolded letters. 


def clean_tweets(tweet): 


# remove stock market tickers Like $GE 
tweet = re.sub(r'\$\w*', °°, tweet) 
# remove old style retweet text “RT” 


tweet = re.sub(r'*RT[\s]+", °°, tweet) 


# remove hyperlinks 


tweet =- re.sub(r'https?: /.*[\r\n]**", **, tweet 
# remove hashtags 
# only removing the hash # sign from the word 


tweet = re.sub(r'#’, `’, tweet) 


#remove cono 


tweet = re.sub(r',',°', tweet) 


#remove number 


tweet = re.sub("[@-9]+', “", tweet) 


# tokenize tweets 
tokenizer = TweetTokenizer(preserve_case-False, strip_handles-True, reduce_len-True) 
tweet_tokens = tokenizer.tokenize(tweet) 


tweets clean = [] 
for word in tweet_tokens: 
if (word not in stop_words and # remove stopwords 
word not in string.punctuation): # remove punctuation 


Figure 2. Sample code fragment used in the data cleaning process 


Sentiment Analysis 
1200 
1000 
2 800 
3 

S 600 
400 
to) 

Ẹ : 3 

ž £ F 

zZ 

Sentiment 
Figure 3. Malaysian’s views on COVID-19 vaccine Figure 4. Word cloud outcome 


4.4. Evaluation of the sentiment analysis model (classifier) 

To assess the performance of the sentiment analysis classifier, four evaluation metrics are used: 
precision, recall, Fl-score, and accuracy. These metrics are commonly used in the evaluation of classification 
models. The results of the evaluation are depicted in Figures 5 and 6. Based on Figure 5, the accuracy of this 
study stands at 97.3%. This can be considered a decent level of accuracy, as any value above 70% is 
considered a good model in evaluating sentiment analysis performance [25]. 

The classification report in this study, depicted in Figure 6, shows the: i) precision, 
ii) recall, and iii) Fl-score of the study's results. The precision score of 93% indicates that the model is 
effective in identifying genuine positive outcomes among all correctly predicted positive results. The recall 
score of 94% shows that the model is capable of accurately predicting every instance in the training dataset. 
The Fl-score of 94% is a high score, which confirms that this model is a good and reliable model for use in 
sentiment analysis. 
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#Divide into prediction and testing 
y_pred = np.zeros (199809) 
y_test - np.zeros (19900) 


indices = np.random. randint(8, 10088, 300) 
indices2 = np. random. randint(0, 19000,4098) 
indices3 = np.random. randint(ð, 10000,5090) 
y _pred[indices1] = 1 

_test[indices2] = 1 

y_pred[indices3] - 1 

y_test[indices3] - 1 

np.sum(y_test) 


863.8 


np.sum(y_pred) 


778.8 


print('Accuracy score: *, accuracy_score(y_test.y_prad)) 


Accuracy score: 90.9375 


confusion_matrix(y_test, y_pred) 


array([[8867, 278], 
L 355, 5908)J, dtype=1nt64) 


Figure 5. Sentiment analysis evaluation 


print(classification_report(y_test,y_pred)) 
precision recall f1-score support 
8.8 8.96 8.97 8.97 9137 
1.8 @.65 8.59 8.62 863 
accuracy 0.94 10000 
macro avg 8.81 8.78 8.79 1909000 
weighted avg 0.93 0.94 8.94 10000 


Figure 6. Evluation metrics 


5. CONCLUSION 

The research aimed to provide a comprehensive understanding of Malaysians’ perceptions of the 
COVID-19 vaccination through sentiment analysis. To achieve this, data was collected from Twitter and 
analysed to determine the sentiment behind the tweets. The results of the study showed that Malaysians have 
a generally neutral perception of the COVID-19 vaccination. This study highlights the importance of 
understanding public perception and sentiment towards a critical issue like the COVID-19 vaccination 
program. The findings can be used to inform and guide healthcare professionals, policymakers, and the 
public in making informed decisions regarding the COVID-19 vaccine. This study can also be used as a 
foundation for future research in the field of sentiment analysis, with the potential for improvement and 
expansion. Further, the findings of this study suggest that the general perception of COVID-19 among 
Malaysians is neutral. The increasing number of vaccinations being administered is evidence of this. While 
some individuals remain skeptical of the COVID-19 vaccine, awareness about it is growing. The word cloud 
analysis shows the frequency with which the COVID-19 vaccine is being discussed on Twitter. The accuracy 
of this study is 94%, which demonstrates its effectiveness in achieving its overall aim. 

Limitations: this study faced several limitations in its implementation. Although the automated method 
can recognize and analyze text in various contexts, it has difficulty understanding complex language features 
such as sarcasm, irony, negations, jokes, and exaggerations. This can lead to incorrect sentiment classification, 
as the system is not able to grasp the intended meaning behind the words. For example, the word "sad" may be 
classified as negative, but in the context of "I was not sad," it should be classified as positive. Similarly, an 
automated sentiment analysis tool may not be able to detect sarcasm, such as in the statement "I'm really loving 
the enormous pool at my hotel!" accompanied by a picture of a small pool. This highlights the challenges faced 
by sentiment analysis tools in accurately analyzing sentiment in complex language. 

Future research: this study has the potential for further development and improvement. One potential 
enhancement is to implement real-time updates of the sentiment analysis results for tracking Malaysians’ 
emotions. This would allow users to track citizens' emotions without having to manually extract new tweets. 
Additionally, the study could incorporate a sentiment classifier correction tool to address typos and 
misspelled words in the dataset and new tweets, which could lead to improved data quality and a higher 
overall accuracy of the classifier. The study could also incorporate user engagement elements such as photos, 
videos, additional buttons for navigation, and notifications to inform users when the data is ready for analysis 
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and evaluation. Moving forward, the study aims to add new algorithms to improve the accuracy of sentiment 
analysis and to continue making advancements in this field of study. 
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